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Abstract.  Feature  selection  algorithms  can  reduce  the  high  dimensionality  of 
textual  cases  and  increase  case-based  task  performance.  However,  conventional 
algorithms  (e.g.,  information  gain)  are  computationally  expensive.  We 
previously  showed  that,  on  one  dataset,  a  rough  set  feature  selection  algorithm 
can  reduce  computational  complexity  without  sacrificing  task  performance. 
Here  we  test  the  generality  of  our  findings  on  additional  feature  selection 
algorithms,  add  one  data  set,  and  improve  our  empirical  methodology.  We 
observed  that  features  of  textual  cases  vary  in  their  contribution  to  task 
performance  based  on  their  part-of-speech,  and  adapted  the  algorithms  to 
include  a  part-of-speech  bias  as  background  knowledge.  Our  evaluation  shows 
that  injecting  this  bias  significantly  increases  task  performance  for  rough  set 
algorithms,  and  that  one  of  these  attained  significantly  higher  classification 
accuracies  than  information  gain.  We  also  confirmed  that,  under  some 
conditions,  randomized  training  partitions  can  dramatically  reduce  training 
times  for  rough  set  algorithms  without  compromising  task  performance. 


1  Introduction 


Textual  case-based  reasoning  (TCBR)  is  a  case-based  reasoning  (CBR)  subfield 
concerned  with  the  use  of  textual  knowledge  sources  (Weber  et  al.,  2005).  TCBR 
systems  differ  in  the  degree  to  which  their  text  content  is  used;  some  are  weakly 
textual  CBR  while  others  are  strongly  textual  CBR,  meaning  that  textual  information 
is  the  focus  of  reasoning  (Wilson  &  Bradshaw,  2000).  Applications  such  as  email 
categorization,  news  categorization,  and  spam  filtering  require  the  use  of  strongly 
textual  CBR  methodologies.  Most  of  these  systems  use  a  bag-of-words  or  term-based 
representation  for  cases  (e.g.,  Wiratunga  et  al.,  2004;  Delany  et  al.,  2005),  which  can 
be  problematic  for  textual  case  bases  that  have  thousands  of  features.  For  example, 
this  huge  dimensionality  could  reduce  accuracies  on  classification  tasks  and/or  result 
in  large  computational  costs. 

A  variety  of  feature  selection  algorithms  can  be  used  to  address  this  issue.  For 
example,  these  include  conventional  algorithms  such  as  document  frequency, 
information  gain,  and  mutual  information  (Yang  &  Pederson,  1997).  Wiratunga  et  al. 
(2004)  extended  these  algorithms  to  include  boosting  and  feature  generalization  with 
considerable  success.  Flowever,  some  of  these  conventional  algorithms  have  high 
computational  complexity,  which  can  be  a  problem  when  a  TCBR  system  is  applied 
to  dynamic  decision  environments  that  require  frequent  case  base  maintenance. 
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Feature  selection  algorithms  based  on  rough  set  theory  (RST)  rather  than 
conventional  algorithms  can  potentially  alleviate  this  high  computational  complexity 
and  also  increase  the  task  performance  of  TCBR  systems.  RST  is  a  relatively  novel 
approach  for  decision  making  with  incomplete  information  (Pawlak,  1991).  Feature 
selection  algorithms  motivated  by  RST  have  been  applied  with  much  success  in  non¬ 
textual  CBR  systems  (e.g.,  Pal  &  Shiu,  2004).  Recently,  these  algorithms  have  been 
applied  to  textual  data  sets.  For  example,  Chouchoulas  and  Shen  (2001)  applied  a 
rough  set  algorithm  called  QuickReduct  to  select  features  for  an  email  categorization 
task.  Also,  we  examined  a  rough  set  feature  selection  algorithm,  called  Johnson’s 
reduct,  to  a  multi-class  classification  problem  (Gupta  et  al.,  2005).  We  empirically 
demonstrated  that  this  algorithm,  for  one  data  set,  was  an  order  of  magnitude  faster 
than  information  gain  and  yet  provided  comparable  classification  performance.  We 
also  introduced  a  methodology  that  randomly  partitions  a  training  set,  and  selects  and 
merges  features  from  each  partition.  This  randomized  training  partitions  procedure 
can  dramatically  reduce  feature  selection  time.  We  showed  that  its  combination  with 
Johnson’s  reduct  was  effective. 

In  this  paper,  we  extend  our  earlier  work  on  feature  selection  for  TCBR 
classification  tasks  by  exploring  additional  rough  set  algorithms.  In  particular,  we 
introduce  a  variant  of  Li  et  aids  (2006)  relative  dependency  metric,  called  the 
marginal  relative  dependency  metric,  and  explore  its  effectiveness  with  randomized 
training  partitions.  In  addition,  we  introduce  the  notion  of  part-of-speech  bias  in 
textual  case  bases.  This  is  based  on  our  observation  that  textual  features  with  different 
parts  of  speech  may  inherently  differ  in  their  ability  to  contribute  to  reasoning.  For 
example,  noun  features  may  contribute  more  than  verb  features,  as  described  in 
Section  3.4.  Adapting  rough  set  and  conventional  feature  selection  algorithms  to 
incorporate  this  bias  could  improve  their  performance.  We  empirically  investigate 
these  issues  on  two  data  sets. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  introduces  RST  and  two  of 
its  derivative  feature  selection  algorithms.  We  also  include  a  description  of 
randomized  training  partitions  and  introduce  the  notion  of  part-of-speech  bias.  We 
present  an  empirical  evaluation  of  the  feature  selection  algorithms  and  their 
interaction  with  randomized  training  partitions  and  part-of-speech  bias  in  Section  3. 
We  review  related  work  on  feature  selection  in  Section  4  and  conclude  with  a 
discussion  of  our  plans  for  future  research  in  Section  5. 


2  Rough  Set  Theoretic  Feature  Selection 

2.1  Building  Blocks  of  Rough  Set  Theory 

For  the  sake  of  clarity  for  this  audience,  we  use  established  CBR  terminology,  such  as 
cases  and  features,  to  present  the  elements  of  RST.  RST  is  based  on  a  formal 
description  of  an  information  system  (Pawlak,  1991).  An  information  system  S  is  a 
tuple  S=  <C,F,  V  where: 

C=  {cj,  c2,  ...,  c„}  denotes  a  non-empty,  finite  set  of  cases, 

F=  {fi,f2,  ■■■,/,,}  denotes  a  non-empty,  finite  set  of features  (or  attributes),  and 
V=  {Vi,  V2,  ...,  Vm }  is  the  set  of  value  domains  for  the  features  in  F. 


A  decision  table  is  a  special  case  of  an  information  system  where  we  distinguish  two 
kinds  of  features:  (1)  a  class  (or  decision)  feature  fd,  and  (2)  the  standard  conditional 
features  Fp,  which  are  used  to  predict  the  class  of  a  case.  Therefore,  F  =  Fp  U  {fd}. 


Table  1.  A  case  base  fragment  for  hiring  decisions 


Cases 

fi  =  age 

f2  =  experience 

fr=  grades 

II 

CD 

a- 

Ci  =  Anna 

21-30 

none 

good 

yes 

c2  =  Bill 

21-30 

none 

good 

no 

c3=  Cathy 

21-30 

4-6 

average 

no 

c4  =  Dave 

31-40 

1-3 

excellent 

yes 

c3  =  Emma 

31-40 

4-6 

good 

yes 

C6  =  Frank 

31-40 

4-6 

good 

yes 

We  will  explain  RST  concepts  using  the  trivial  case  base  in  Table  1,  which  pertains 
to  making  hiring  decisions  based  on  three  features.  Central  to  RST  is  the  notion  of 
indiscernibility.  Examining  the  cases  in  Table  1,  we  see  that  cases  Ci= Anna  and 
C2=Bill  have  identical  values  for  all  the  features,  and  thus  are  indiscernible  with 
respect  to  the  three  conditional  features  f,f2,  and  f3.  More  broadly,  a  set  of  cases  C'  is 
indiscernible  with  respect  to  a  set  of  features  F'  c  F  if  the  following  is  true: 

IND(F',C)=  {  C'E  C  |  VfzF',  Vcfccy(M)eC'  Ac.)  =Acj)}  (D 

Thus,  two  cases  are  indiscernible  with  respect  to  features  in  F'  if  they  have  identical 
values  for  all  the  features  in  F'. 

An  indiscernibility  relation  is  an  equivalence  relation  that  partitions  the  set  of  cases 
into  equivalence  classes.  Each  equivalence  class  contains  a  set  of  indiscernible  cases 
for  the  given  set  of  features  F'.  For  example,  given  the  hiring  decision  table: 

IND(F',  C)={{Cl,c2},{  c3  },{  c4 },{  c5  ,  c6}} 

where  F '={age,  experience,  grades }  and  C={ci,c2,c3,c4,c5,c6} .  The  equivalence  class 
of  a  case  c,  with  respect  to  selected  features  F'  is  denoted  by  Based  on  the 
equivalence  classes,  RST  develops  two  kinds  of  set  approximations.  First,  given  sets 
C'c  C  and  Ffi  F,  the  lower  approximation  of  C'  with  respect  to  F'  is  defined  as: 

lower(C,  F',  C)  =  {ceC  |  [c>'  <=  C'}  (2) 

or  the  collection  of  cases  whose  equivalence  classes  are  subsets  of  C.  Second,  the 
upper  approximation  of  C'  with  respect  to  F'  is  instead  defined  as: 

uppefiC,  F',C)={ceC\  [c>- n  C'  A  0 }  (3) 

or  the  collection  of  cases  whose  equivalence  classes  have  a  non-empty  intersection  set 
with  C.  A  set  of  cases  C'  is  crisp  (or  definable)  if  lower(C,  F',C )  =  upper(C,  F',C ), 
and  is  otherwise  rough. 

For  example,  in  the  hiring  decision  table,  consider  C' ihlred=yes]=  {c/,  c4,  c5,  c6 },  then 
the  lower  and  upper  approximations  of  C\hin,d=yes }  with  respect  to  F'={age, 
experience,  grades }  are: 

lower(C,  F',C\hired=yes})={c4,  c5,  c6}  and  upper(C,  F' ,C\hired=yes})  ={ch  c2,  c4,  c5,  c6 } 

Case  Ci  is  not  included  in  the  lower  approximation  because  its  equivalence  class 
{ci,C2}  is  not  a  subset  of  C'{h,red=yes).  However,  it  is  included  in  the  upper 
approximation  because  its  equivalence  class  has  a  non-empty  intersection  with 

fit 

{hired=yes}. 


Another  important  RST  element  is  the  notion  of  a  set  called  the  positive  region. 
The  positive  region  of  a  decision  feature//  with  respect  to  F'c  F  is  defined  as: 

POSF'(fd,C)  =  U  { lower(C,  F',Cr)  \  C'  e  IND({fd),C)}  (4) 

or  the  collection  of  the  _F-lower  approximations  corresponding  to  all  the  equivalence 
classes  of//.  For  example,  the  positive  region  of /  {hiring}  with  respect  to  F'={age, 
experience,  grades},  where  lower(C,  F',C\himd=no))={c3} ,  is  as  follows: 

POSF{fd,C)  =  lower(C,  F',C'{hiKd=yes))  U  lower(C,  F\C  {hired=no}')  {Cj,  C4,  C s,  } 

The  positive  region  can  be  used  to  develop  a  measure  of  a  feature’s  ability  to 
contribute  information  for  decision  making.  A  feature/ e  F'  makes  no  contribution  or 
is  dispensable  if  POSF{fd,C)  =  POSF'-\fd)(fd,C)  and  is  indispensable  otherwise.  That  is, 
removing  the  feature  /  from  F'  does  not  change  the  positive  region  of  the  decision 
feature.  Therefore,  features  can  be  selected  by  checking  whether  they  are 
indispensable  with  respect  to  a  decision  variable.  The  minimal  set  of  features  F',  F'  c 
F  is  called  a  reduct  if  POSF{fd,C)  =  POS,<f,,C). 

Often,  an  information  system  has  more  than  one  possible  reduct.  Generating  a 
reduct  of  minimal  length  is  a  NP-hard  problem.  Therefore,  in  practice,  algorithms 
have  been  developed  to  generate  one  “good”  reduct.  Next,  we  present  our  adaptations 
of  two  such  algorithms:  (1)  Johnson’s  heuristic  algorithm  and  (2)  the  marginal  relative 
dependency  algorithm. 


2.2  Feature  Selection  with  Johnson’s  Heuristic  Algorithm 

We  adapted  Johnson’s  (1974)  heuristic  to  compute  reducts  as  follows.  It  sequentially 
selects  features  by  finding  those  that  are  most  discernible  for  a  given  decision  feature 
(see  Figure  1).  It  computes  a  discemibility  matrix  M,  where  each  cell  m,j  of  the  matrix 
corresponding  to  cases  c ,■  and  c,  includes  the  conditional  features  in  which  the  two 
cases’  values  differ.  Formally,  we  define  strict  discemibility  as: 

m,j  =  {{/e  Fp'-jic, )  *J[Cj)}  for /(c,)  ///c,),  and  0  otherwise  } 
JohnsonsReduct(7/  fd,  C) 

Input  Fp\  conditional  features,/:  decision  feature,  C:  cases 
Output  R:  Reduct  R  £  Fp 

1  0,  F'<—Fp 

2  M<—  compnteDiscernibilityMatrix(C,  F',fd ) 

3  do 

4  /,<—  selectHighestScoringFeature{M) 

5  R  <-R  U  {/} 

6  for  (i=0  to  |C|,  /=/  to  |C|) 

7  ntij  <-  0  if  fh  e  nuj 

8  F'<~-  F'  -  {/} 

9  until  ntij  =  0  V/,  / 

1  return  R 


0 


Figure  1.  Pseudocode  for  Johnson’s  heuristic  algorithm 

Given  such  a  matrix  M,  for  each  feature,  the  algorithm  counts  the  number  of  cells  in 
which  it  appears.  The  feature  f,  with  the  highest  number  of  entries  is  selected  for 
addition  to  the  reduct  R.  Then  all  the  entries  m,j  that  contain  fh  are  removed  and  the 
next  best  feature  is  selected.  This  procedure  is  repeated  until  M  is  empty. 

The  computational  complexity  of  JohnsonsReduct  is  0(V(fi),  where  V  is  the 
(typically  large)  vocabulary  size  and  bounds  the  number  of  times  the  do  loop  is 
executed.  However,  this  is  a  loose  upper  bound  that  is  better  approximated  by  0(RC~), 
where  is  R«V.  Comparing  this  complexity  with  the  computational  complexity  of 
information  gain,  which  is  0{MVC),  where  M  is  the  number  of  classes,  the 
complexity  of  JohnsonsReduct  is  lower  because,  typically,  RC<MV.  However,  the 
worst  case  space  complexity  of  JohnsonsReduct  is  0(VC~),  which  is  significantly 
greater  than  Information  Gain’s  space  complexity  of  0{VC). 

In  TCBR  applications,  each  case  may  have  only  a  small  subset  of  features.  Strict 
discernibility  could  be  implemented  as  follows:  fie)  ^  fief  if  only  one  of  the  cases  c, 
or  Cj  contains  the  term  denoted  by  the  feature  f.  However,  such  an  approach  ignores 
the  information  contained  in  the  variation  of  term  frequencies  (i.e.,  value)  across 
cases.  Hence,  a  graded  or  fuzzy  notion  of  indiscemibility,  instead  of  a  strict  notion, 
may  be  more  effective  (e.g.,  Skowron,  1995).  We  extend  strict  discernibility  to 
graded  or  fuzzy  discernibility  using  a  similarity  computation  as  follows.  In  Equation 
5,  we  consider: 

fie,)  *  ficj),  when  sim(fic,),  ficj))  <xf  (6) 

where  (0<T/<1  )  is  a  user  defined  similarity  threshold.  We  adapt  a  similarity  measure 
for  ordinal  scales  (Montazemi  &  Gupta,  1997)  to  compute  the  similarity  between  two 
non-zero  frequency  valued  features  as  follows: 

r  I  -abs( (fic ,)-  fie,))! r[> . a/,  when  abs{(j{c,)-ficj))  <  ty.of  (7) 
sim(fic,),ficj ))  o,  otherwise 

where  a,  is  the  standard  deviation  of  non-zero  frequency  values  for  feature/  and  r|)  > 
0  is  a  user-defined  parameter  for  adjusting  similarity  sensitivity.  For  example,  for  a 
feature /  with  a/=1.87  and  rjj=l, 

sim( 4,  5)  =  1  -  afe(4-5)/ 1.87*1  =  0.465 

Similarly,  the  issue  of  class  feature  discernibility  arises  in  TCBR  for  multiclass 
classification  tasks  in  which  more  than  one  class  can  be  assigned  to  a  case.  For 
example,  topic  assignment  is  a  multi-class  classification  task.  In  Equation  5,  we 
consider: 


Me,)  ffficj),  when  sim(fd(c,),  ffcf)  <xd  (8) 

where  ffc,)  can  be  a  set  of  values,  simiffc),  ffefi)  yields  the  ratio  of  the  intersection 
of  its  values  to  their  union,  and  0  <xd<  1  is  a  user  defined  similarity  threshold. 


2.3  Feature  Selection  using  Marginal  Relative  Dependency 

In  Section  2.1,  we  described  how  an  indiscemibility  (or  equivalence)  relation 
partitions  a  case  base  C  into  equivalence  classes  with  respect  to  a  set  of  features  F'. 


Intuitively,  with  an  increase  in  the  number  of  features  in  F',  we  expect  the  number  of 
equivalence  classes  to  increase  and  each  equivalence  class  to  contain  fewer  cases.  The 
degree  of  relative  dependency  of  a  set  of  features  F'  builds  on  this  intuition.  For  a 
decision  feature//  and  a  set  of  features  F',  it  is  defined  as  (Li  et  al.,  2006): 


^ (F) 


(9) 


where  Y\r.  (C)  is  the  set  of  equivalence  classes  generated  over  C  with  respect  to 
features  F'  and  T  (C)  is  the  set  of  equivalence  classes  generated  over  C 

with  respect  to  features  F'  U  {/}.  Clearly,  the  maximum  value  of  5  If  is  1.  Based 

F 

on  this  measure,  we  compute  the  marginal  contribution  of  a  feature  /  (i.e.,  marginal 
relative  dependency),  denoted  by  p/,  as  follows: 

M '/  F'U{f}  ^  F"  (10) 

In  addition  to  using  p/  as  a  metric  for  selecting  features,  it  can  also  be  used  as  a 

feature  weight  because  ^Vv  =  ^ ,  where  R  is  a  reduct. 
f&t 

Our  variation  on  this  reduct  computation  algorithm,  called  the  Marginal  Relative 
Dependency  algorithm  (MRD),  is  as  follows  (see  Figure  2).  At  each  iteration,  it 
computes  the  marginal  relative  dependency  of  all  the  candidate  features  T,  selects  the 
feature  /„  with  the  maximum  marginal  relative  dependency,  and  adds  it  to  the  reduct 
R.  The  algorithm  terminates  when  the  relative  dependency  8s  —  (3  ,  where  p  is  a  user 
defined  parameter  in  the  range  (0  <p<  1).  In  a  TCBR  application,  it  is  possible  that 

beyond  a  certain  point  both  p7  and  5  may  behave  asymptotically.  Therefore,  P 
can  be  specified  to  terminate  the  feature  selection  process  early. 


MRD(//,/,  C) 

Input  Fp\  Conditional  features,/):  Decision  feature,  C:  Cases,  P:  Threshold 
Output  R:  Reduct  R  £  Fp 

1  0,  F'*-Fp,  8s<— 0 

3  do 

4  <fm.  [Am:>  <—  selectMaximallyContributingFeatureAndValue(F',C) 

5  R  <-R  U  {fm  \ 

6  F'^F'-{fm) 

1  6s  <—  bR  +  [Xm 

8  until  6S  =  p 

9  return  R 

Figure  2.  Pseudocode  for  the  Marginal  Relative  Dependency  algorithm  (MRD) 

Like  JohnsonsReduct,  the  determination  of  equivalence  classes  in  MRD  can  be  based 
on  a  strict  or  graded  notion  of  discemibility.  For  the  graded  notion  of  discemibility  we 
apply  Equations  6,  7,  and  8. 


The  worst  case  computational  complexity  of  MRD  is  O (RVC2).  For  large  textual 
case  bases,  this  is  an  order  of  magnitude  more  complex  than  JohnsonsReduct  and 
information  gain.  However,  its  worst  case  space  complexity  is  only  O(VC). 


2.4  Feature  Selection  with  Random  Training  Set  Partitions 

The  computational  complexities  of  the  feature  selection  algorithms  discussed  above 
depend  on  C,  the  number  of  training  cases.  The  complexities  of  both  RST  approaches, 
JohnsonsReduct  and  MRD,  are  a  function  of  the  square  of  the  number  of  training 
cases.  Therefore,  reducing  the  number  of  training  cases  that  need  to  be  considered  at 
one  time  can  dramatically  reduce  feature  selection  and  training  time.  We  can 
accomplish  this  by  using  randomized  training  partitions  (RTP)  (Gupta  et  ah,  2005), 
which  is  a  procedure  with  the  following  steps: 

1 .  Randomly  create  m  equal-sized  partitions  of  the  training  set. 

2.  From  each  partition,  select  features  using  a  feature  selection  algorithm  (e.g., 
JohnsonsReduct  or  MRD). 

3.  Define  the  final  feature  set  as  the  union  of  features  selected  from  each 
partition. 

This  approach  could  reduce  the  training  time  by  a  factor  of  m  for  the  RST  feature 
selection  algorithms. 


2.5  POS-Biaser:  A  Part-of-speech  Bias  Adjustment  Method 

In  TCBR,  words  or  terms  are  typically  used  as  features.  The  linguistic  attributes 
associated  with  such  features  (e.g.,  part-of-speech  (POS),  syntactic  roles)  could 
impact  feature  selection  and  TCBR  task  performance.  For  example,  it  is  likely  that 
noun  features  are  generally  more  informative  than  verb  features  possibly  because 
nouns  are  an  open  class  of  words,  whereas  verbs,  adjectives,  adverbs,  prepositions, 
and  pronouns  are  closed  classes  of  words  (Quirk  et  al.,  1985).  Open  word  classes  are 
frequently  extended  to  include  new  words,  whereas  closed  classes  are  rarely 
extended.  Thus,  a  large  percentage  of  terms  in  a  typical  vocabulary  are  nouns. 
However,  each  noun  feature  may  occur  in  relatively  fewer  cases  and  has  the  potential 
to  be  more  informative  towards  a  decision.  In  contrast,  verbs  tend  to  occur  more 
frequently  across  many  cases.  Also,  there  is  considerable  flexibility  in  the  choice  of 
verbs  used  to  express  the  case  content.  This  causes  variability  in  verb  expressions  that 
could  be  inappropriately  construed  as  informative  (e.g.,  by  information-theoretic 
measures)  and  as  a  result  may  be  favored  by  feature  selection  algorithms.  For 
example,  this  would  adversely  affect  JohnsonsReduct,  which  relies  on  pair-wise  case 
comparisons  to  construct  a  discemibility  matrix.  It  is  likely  to  select  spurious  verbs, 
as  could  MRD  and  information  gain  (IG)  (Yang  &  Pederson,  1997). 

One  way  to  counter  the  effect  of  this  inherent  potential  bias  of  textual  case  bases  is 
to  bias  the  feature  selection  algorithms  accordingly.  Thus,  we  introduce  a  simple 
methodology,  called  POS-Biaser,  to  use  in  combination  with  a  feature  selection 
algorithm.  POS-Biaser  assumes  that  part-of-speech  tagging  is  performed  during  the 
case  indexing  process.  This  is  feasible  because  part  of  speech  taggers  are  publicly 
available  (e.g.,  Brill,  1993).  POS-Biaser  uses  a  POS  biasing  factor  ppos  for  each  POS 


along  with  a  feature  selection  metric  to  select  features.  For  example,  when  p,IO„„  =  1.8, 
p verb  =  0.6,  p adjective  =  1,  and  p adVerb  =  0.3,  the  feature  selection  algorithm’s  values  for 
nouns  are  inflated  to  1.8  times  their  original  value,  the  values  for  verbs  are  deflated  to 
0.6  times  their  original  value,  and  so  on. 

The  POS-Biased  JohnsonsReduct  includes  a  modification  to  the  step  that  executes 
selectHighestScoringFeature{M)  (Figure  1,  line  4),  which  computes  the  number  of 
cell  entries  as  the  score  of  each  feature  (i.e.,  the  feature  selection  metric).  In 
particular,  feature  scores  are  now  multiplied  by  their  respective  ppos  values.  This 
would  bias  JohnsonsReduct  to  select  more  noun  features  than  its  unbiased  version. 
Likewise,  we  accommodate  a  POS  bias  in  MRD  by  similarly  modifying  the  statement 
that  executes  selectMaximallyContribiitingFeatureAndValue(F',C). 


3  Evaluation 


3.1  Claims  and  Empirical  Methodology 

We  evaluated  the  feature  selection  algorithms  described  in  this  paper  to  explore  the 
following  hypotheses: 

1.  Rough  set  methods  perform  as  well  as  or  outperform  information  gain  on  our 
case-based  classification  tasks. 

2.  The  performances  of  rough  set  feature  selection  algorithms  are  affected  by  the 
POS  bias  in  textual  case  bases. 

3.  RTP  is  an  effective  way  to  dramatically  reduce  feature  selection  time  without 
compromising  case-based  task  performance. 

We  selected  both  a  single  and  a  multi- classification  task  to  evaluate  the  utility  of 
the  feature  selection  and  POS-biasing  algorithms  for  a  simple  case-based  classifier. 
Single  classification  involves  assigning  exactly  one  class  label  to  a  new  text  case, 
while  multi-classification  involves  assigning  one  or  more  class  labels.  For  example, 
sorting  emails  into  a  known  set  of  folders  is  a  single  classification  task  and  assigning 
one  or  more  topic  to  news  articles  is  a  multi-classification  task. 

We  selected  tasks  from  two  data  sets,  one  for  each  type  of  classification  task.  The  first 
data  set  is  Reuters-21578  (Reuters,  2006);  it  contains  news  items  and  its  multi¬ 
classification  task  concerns  assigning  topics  to  these  items.  The  second  data  set  is  a 
subset  of  20-News  Groups  (Lang,  2006);  it  contains  news  group  emails  and  its  single 
classification  task  concerns  assigning  a  news  group  label  to  each  of  these  emails.  Due 
to  the  relatively  high  computational  and  space  complexities  of  the  algorithms  being 
tested,  we  selected  only  the  first  ten  news  groups  for  evaluation  in  this  data  set;  we 
call  this  10-News  Groups.  Table  2  summarizes  the  characteristics  of  both  data  sets. 


Table  2.  A  summary  of  the  characteristics  of  the  data  sets  used  in  the  experiments 


Characteristic 

Reuters-21578 

10-News  Groups 

Number  of  Cases 

11,330  (with  more  than  0  topics) 

10,013 

Number  of  Classes 

110 

10 

Num.  Cases  per  class 

103  (Avg.) 

1001.3  (Avg.) 

Num.  Classes  per  Case 

1.26  (Avg.),  1  (min.),  16  (max.) 

1 

Num.  Words  per  case 

137  (Avg.) 

200.35  (Avg.) 

We  used  two  rough  set  feature  selection  algorithms  (JohnsonsReduct  (JR)  and 
MRD)  and  one  conventional  feature  selection  algorithm,  namely  IG  (Yang  & 
Pederson,  1997).  In  the  experiments,  for  a  fair  comparison,  we  ensured  that  all  the 
algorithms  selected  the  same  number  of  features,  and  used  JR  to  determine  how  many 
features  to  select.  Finally,  we  also  incorporated  the  POS  bias  in  each  feature  selection 
algorithm,  and  refer  to  them  as  JRB,  MRDB,  and  IGB,  respectively. 

Our  feature  generation  algorithm  performs  tokenization,  POS  tagging,  and 
morphotactic  parsing  to  create  POS-tagged  terms  as  features.  Morphotactic  parsing  is 
a  more  involved  method  than  simple  stemming;  it  reduces  terms  to  their  baseforms 
even  across  different  POS  (Gupta  &  Aha,  2004).  For  example,  it  reduces  the  noun 
“computer”  to  the  verb  “compute”.  Features  with  document  frequency  greater  than 
two  were  considered  for  feature  selection. 

We  applied  a  k-nearest  neighbor  classifier  with  the  fuzzy  feature  similarity  function 
described  in  Equation  7  to  evaluate  classification  performance  using  the  selected 
features.  (We  set  k= 5  based  on  feedback  from  our  initial  empirical  studies.)  All 
features  were  weighted  equally  to  isolate  the  selection  behaviors  of  the  feature 
selection  algorithms  in  our  experiments.  Multi-classification  task  performance  was 
measured  using  11-point  average  precision,  which  is  the  average  precision  obtained  at 
recall  thresholds  of  (0%,  20%,  ...100%).  The  classifier  assigns  as  many  topics  as 
needed  until  a  given  recall  is  achieved  (Yang  &  Pederson,  1997).  Performance  on  the 
single  classification  task  was  measured  as  classification  accuracy.  We  also  measured 
feature  selection  time  (in  seconds)  for  each  algorithm. 

We  used  a  two-fold  cross  validation  strategy  to  evaluate  the  algorithms.  Two  sets  of 
two  folds  were  randomly  created.  For  RTP,  all  the  algorithms  were  run  with  the  same 
set  of  10,  20,  30,  and  40  randomized  training  partitions  in  each  fold.  We  did  not 
experiment  without  partitions  due  to  the  RTS  algorithms'  high  computational  and 
memory  requirements. 


3.2  Empirical  Results 
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□  Unbiased 

□  Biased 


Feature  Selection  Methods 


Results  with  the  Reuters- 
21578  Data  Set.  The  key 
results  for  the  six  algorithms 
(i.e.,  JR,  IG,  MRD,  JRB, 

IGB,  and  MRDB)  on  this 
data  set  are  shown  in  Figures 
3-5.  JR  selected  an  average  of 
95.5,  118,  135,  and  139.5 
features  for  partitions  of  size 
10,  20,  30,  and  40, 

respectively.  Increasing  the 
number  of  RTP  partitions 
increases  the  chance  of 
selecting  different  features  in 

different  partitions,  which  Figure  4.  The  effect  of  POS-bias  on  the  number 

increases  the  total  number  of  unique 
features  selected. 

We  comparatively  analyzed  the 

algorithms’  precision  results  using  one-tailed  paired  student  t-tests.  Comparisons  of 
the  feature  selection  algorithms’  unbiased  versions  show  that  JR  significantly 
outperformed  IG  for  every  number  of  partitions  tested  (e.g.,  76.72%  vs.  70.17%  at  10 
partitions  [p~0006]),  as  did  MRD  (e.g.,  79.21%  vs.  75.86%  at  40  partitions 
[p=.0018]).  Therefore,  both  the  rough  set  feature  selection  methods  significantly 
outperformed  a  conventional  feature  selection  method.  In  addition,  MRD  significantly 
outperformed  JR  at  partitions  of  30  and  40  (e.g.,  79.20%  vs.  77.83%  at  40  partitions 
[p=.0003]),  but  the  reverse  was  true  for  10  partitions. 

Comparing  the  POS-biased  versions  of  the  feature  selection  algorithms  with  their 
respective  unbiased  versions  shows  that  JRB  and  IGB  outperform  JR  and  IG 
respectively  at  all  RTP  sizes.  For  example,  at  30  partitions,  JRB  significantly 
outperforms  JR  (82.61%  vs.  77.26%  [p=. 0007])  and  IGB  significantly  outperforms  IG 
(76.84%  vs. 74. 79%  [p=.0019]).  However,  MRDB  significantly  outperforms  MRD 
only  at  10  and  20  partitions;  for  30  and  40  partitions  there  was  no  significant 
difference.  Overall,  POS  bias  had  a  positive  effect  on  all  the  feature  selection 
algorithms,  including  IG.  It  was 


of  noun  features  selected  by  the  three  algorithms 
for  Reuters-21578  using  10  RTP  partitions 


-JR 

-IG 

-JRB 

-IGB 


most  effective  with  JR,  whose 
classification  accuracy 

improved  by  6.1%  on  average 
versus  its  unbiased  version. 

Finally,  when  adjusted  for  POS 
bias,  JR  recorded  significantly 
higher  precision  results  than 
the  other  feature  selection 
algorithms  we  tested. 

Figure  4  shows  the  effect  of 
POS  bias  on  the  three  feature 
selection  algorithms  for 
Reuters-21578  at  10  partitions. 

™  .  Figure  5.  feature  selection  times  (Reuters-21578) 

The  proportion  ol  noun  features 

without  bias  were  at  comparable  levels  for  JR  and  IG  (each  at  71%)  and  slightly 


lower  for  MRD  (64%).  With  this  bias,  the  proportion  of  noun  features  increased  to 
93%  for  JR  and  IG  and  94%  for  MRD.  The  increase  in  the  proportion  of  noun 
features  was  comparable  and  consistent  across  the  three  algorithms,  yet  its  effect  on 
JR’s  precision  performance  was  most  substantial.  Thus,  we  conclude  that  JR  is  most 
sensitive  to  POS  bias. 

Figure  5  shows  the  feature  selection  times  for  IG,  JR,  IGB,  and  JRB.  JR  has  the 
lowest  feature  selection  time.  It  decreased  by  81.92%  from  510  seconds  at  10 
partitions  to  92  seconds  at  40  partitions,  without  decreasing  average  precision, 
demonstrating  that  RTP  is  highly  effective.  Its  biased  version  (JRB)  has  higher  feature 
selection  times  (10,382  sec.  at  10  to  738  sec.  at  40  partitions)  but  achieves  a  similar 
decrease  in  feature  selection  time  as  the  number  of  partitions  increases.  JRB’s  times 
are  higher  than  JR’s  because  POS  bias  significantly  increases  the  reduct  sizes.  In 
contrast,  IG  and  IGB  have  the  same  feature  selection  times.  It  reduces  by  54%  (3780 
seconds  to  1725  seconds)  as  the  number  of  partitions  is  increased  from  10  to  40.  As 
expected,  MRD  has  extremely  long  feature  selection  times  (99,843  sec.  at  10 
partitions  to  22,276  sec.  at  40  partitions;  not  shown  in  Figure  5),  and  MRDB  times  are 
even  longer.  However,  they  both  recorded  a  substantial  drop  in  feature  selection  time 
as  the  number  of  partitions  was  increased.  Therefore,  RTP  is  effective  in  reducing 
feature  selection  time  on  Reuters-21578 for  the  three  algorithms  we  tested. 

Results  with  the  10-News  Groups  Data  Set.  As  with  the  Reuters-21578  data  set,  we 
again  used  the  number  of  features  selected  by  JR  as  a  baseline  for  the  other 
algorithms.  It  selected  an  average  of  123,  134.75,  141.25,  &  153.5  features  at  10,  20, 
30,  and  40  partitions,  respectively. 

Comparison  of  the  unbiased 
versions  of  the  algorithms 
show  that  IG  attains 
significantly  higher  accuracies 
than  the  others  at  all  RTP  levels 
on  the  10-News  Groups  data 
(see  Figure  6).  For  example,  at 
30  partitions,  IG  outperformed 
JR  (70.31%  vs.  51.74%, 
[p-0005])  and  MR  (70.31% 
vs.  57.82%,  p=.0005]).  This 
contrasts  with  its 

comparatively  poor  precision 

Figure  6.  Classification  accuracies  (10-News  Groups)  performance  on  the  Reuters- 

21578  data  set. 

Comparing  the  two  rough  set  methodologies  with  each  other  reveals  that  MRD 
significantly  outperformed  JR  at  30  and  40  partitions  (e.g.,  57.82  %  vs.  51.74%  at  30 
partitions,  [p=. 022]).  This  finding  is  consistent  with  those  on  the  Reuters  data  set. 
However,  MRD’s  performance  could  not  be  objectively  compared  with  JR  at  10  and 
20  partitions  because  it  selected  fewer  features  than  JR  at  those  partitions. 

Comparing  the  algorithms’  biased  and  unbiased  versions  show  that  JRB  and 
MRDB  attain  significantly  higher  classification  accuracies  than  JR  and  MRD, 
respectively.  For  example,  JRB’s  average  accuracy  is  significantly  higher  than  JR’s  at 
30  partitions  (74.68%  vs.  51.74%,  [p=,0006])  and  MRDB  outperforms  MRD 

(61.47%  vs.  57.82%,  [p=. 022]).  In  contrast,  IG  was  adversely  affected  by  bias.  That 


□  Unbiased  | 
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Figure  7.  The  effect  of  POS-bias  on  the  number  of 
noun  features  selected  by  the  three  algorithms  for 
10-News  Groups  using  30  RTP  partitions 


is,  IG  performed  slightly  better  than 
IGB  (e.g.,  70.31%  vs.  69.36%  at  30 
partitions),  although  this  difference 
was  small  and  statistically 
insignificant.  Overall,  JRB 
significantly  outperformed  the  other 
algorithms  at  20,  30,  and  40 

partitions.  For  example,  it  attained 
significantly  higher  average 
classification  accuracies  than  IG 
(74.7%  vs.  70.3%  at  30  partitions 
[£>=.0018]). 

One  possible  reason  could  be  that 
we  used  the  same  POS  bias 
parameter  settings  for  all  the 

algorithms,  but  IG  may  require 
different  settings.  We  gained 
additional  insight  into  this  by 

examining  the  effect  of  POS  bias  on  the  algorithms  (see  Figure  7).  The  unbiased 
versions  of  the  algorithms  selected  different  proportions  of  noun  features;  JR  selected 
51%,  IG  selected  55%,  and  MRD  selected  61%  at  30  partitions.  Examining  the  biased 
versions  shows  that  JRB  selects  96%,  while  IGB  and  MRDB  select  100%,  indicating 
that  the  bias  factors  may  be  too  strong  for  IG  and  MRD. 

Analyses  of  the  feature  selection  times  shows  that  JR’s  times  steadily  decrease 
from  325  seconds  at  10  partitions  to  60  seconds  at  40  partitions  and  is  the  lowest 
among  all  algorithms  at  20-40  partitions  (see  Figure  8).  Feature  selection  times  for  IG 
and  IGB  remain  relatively  constant  (268  seconds,  on  average)  across  different 
partition  sizes.  In  contrast,  JRB’s 
feature  selection  times  decreased 
dramatically  from  10  to  20 
partitions,  but  increased  from  30 
to  40.  This  occurred  because  the 
decrease  in  the  number  of  cases 
per  partition  is  offset  by  larger 
increases  in  the  reduct  sizes, 
thereby  leading  to  an  overall 
increase  in  feature  selection  times. 

For  the  same  reason  MRD  and 
MRDB’s  times  steadily  increase 
from  6291  seconds  at  10  partitions 
to  10,134  seconds  at  40  partitions 
(not  shown  in  Figure  8).  In  general,  MRD  selects  more  features  than  JR  and  this  is 
further  amplified  for  higher  numbers  of  partitions.  Thus,  RTP  significantly  reduces 
feature  selection  times  for  only  JR  and  JRB  on  the  10-News  Group  data  set. 


IG 
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Number  of  Partitions 


Figure  8.  Feature  selection  times  (10-News  Groups) 


Results  Summary  and  Discussion.  Given  that  one  of  the  rough  set  methods,  JR  with 
suitable  POS  bias,  outperformed  IG  on  both  the  data  sets,  we  partially  accept  our  first 
hypothesis,  which  claims  that  rough  set  methods  significantly  outperform  IG.  We  also 
confirmed  our  second  hypothesis,  which  states  that  POS-bias  has  a  positive  effect  on 


RST  feature  selection  algorithms.  In  particular,  its  effect  on  JR  was  substantial  (6.1% 
increase  in  precision  in  Reuters-21578,  and  41.78%  increase  in  accuracy  in  10-News 
Groups).  Interestingly,  the  effect  of  POS-bias  on  IG  was  mixed:  positive  on  Reuters- 
21578  and  negative  on  10-News  Groups.  We  conjecture  that  the  reasons  for  this 
mixed  result  are  that  the  bias  parameters  for  IG  were  too  strong  for  the  10-News 
Groups  set  and  that  IG  effectively  counters  the  inherent  POS  bias  when  the  number  of 
cases  per  class  is  large  (e.g.,  1000  as  opposed  to  100). 

We  showed  that  the  RTP  was  effective  in  dramatically  reducing  feature  selection 
time  for  JR.  However,  the  effect  of  RTP  on  MRD  was  mixed.  It  was  positive  on 
Reuters-21578  and  negative  on  10-News  Groups.  Therefore,  we  cannot  fully  confirm 
our  third  hypothesis  that  RTP  is  always  effective  in  reducing  training  time  for  rough 
set  methods.  However,  without  RTP  it  would  have  been  practically  infeasible  to  run 
MRD  and  JR.  We  also  observed  that  RTP  has  a  positive  effect  on  IG,  although  small 
compared  to  RST  methods.  This  is  because  increasing  the  number  of  partitions 
reduces  the  effective  vocabulary  that  IG  must  deal  with  and  IG’s  computational 
complexity  is  linearly  dependent  on  the  vocabulary  size. 


4  Related  Work 


TCBR  systems  have  been  designed  to  support  a  variety  of  applications  such  as  those 
involving  legal  reasoning  (Briininghaus  &  Ashley,  2003),  spam  filtering  (Delany  et 
al.,  2005),  and  news  group  classification  (Wiratunga  et  al.,  2004).  Typically,  TCBR 
systems  that  use  knowledge  poor  approaches  (e.g.,  for  email  classification)  tend  to 
automatically  generate  features  and  operate  on  large  data  sets.  For  example,  Delany  et 
al.  (2005)  used  IG  to  select  features  in  a  spam  filtering  task  and  Wiratunga  et  al., 
(2004)  used  IG  to  select  features  with  boosted  decision  stumps.  However,  unlike  us, 
they  did  not  focus  on  reducing  the  computational  complexity  of  their  feature  selection 
algorithms.  Furthermore,  high  computational  complexity  was  not  a  limiting  factor 
because  their  binary  classification  task  is  not  particularly  demanding  of  information 
gain,  especially  given  that  their  case  bases  were  relatively  small,  containing  only 
about  1000  cases.  We  instead  investigate  multi-classification  and  n-ary  classification 
tasks  involving  thousands  of  cases,  which  require  more  attention  to  computational 
complexity.  Despite  these  differences,  our  feature  selection  algorithms,  randomized 
training  partitions,  and  POS  biasing  can  be  effectively  integrated  with  their  approach. 

Given  a  set  of  manually  selected  features,  Briininghaus  &  Ashley’s  (2003)  TCBR 
system  induces  a  set  of  classifiers  that  can  automatically  assign  features  to  text 
documents.  They  used  1D3  to  induce  these  classifiers.  If  the  number  of  features  is 
large,  its  performance  would  degrade  significantly.  In  such  situations,  our  feature 
selection  algorithms  could  significantly  improve  lD3’s  performance. 

While  RST-motivated  feature  selection  algorithms  have  recently  been  applied  to 
textual  case  bases  on  classification  tasks,  we  are  the  first  group  to  highlight 
complexity  issues  (Gupta  et  al.,  2005).  For  example,  Chouchoulas  &  Shen  (2001) 
applied  their  QuickReduct  method  for  email  classification.  While  QuickReduct’s 
complexity  (Gupta  et  al.,  2005)  is  high  (i.e.,  the  same  as  MRD),  they  did  not  address 
complexity  because  their  data  included  only  1500  cases.  Furthermore,  they  did  not 
compare  QuickReduct  with  any  conventional  feature  selection  algorithms,  such  as  IG. 


Li  et  al.  (2006)  developed  a  Fast  Rough  Set  Feature  Reduction  algorithm.  Unlike 
the  RST  algorithms  we  evaluated,  it  is  not  feasible  to  isolate  the  contributions  of  RST 
in  their  hybrid  conventional/RST  algorithm.  In  particular,  they  used  IG  to  rank-order 
the  features  for  selection  and  the  relative  dependency  metric  only  to  terminate  feature 
selection.  Finally,  they  did  not  compare  the  performance  of  their  algorithm  with 
conventional  algorithms. 

An  et  al.  (2004)  developed  a  rough  set  feature  selection  method  called  ELEM2 
and  applied  it  to  web  page  classification.  As  with  the  other  research  groups,  they  did 
not  address  complexity  issues  and  evaluated  their  algorithm  on  a  relatively  small  set 
of  327  web  pages.  Moreover,  they  tested  their  algorithm  only  with  the  most  frequently 
occurring  20,  30,  and  40  keywords  per  category.  Although  this  drastically  reduces 
their  data  set’s  number  of  features,  frequency-based  keyword  selection  is  not  always 
competitive  with  other  feature  selection  algorithms  (Yang  &  Pederson,  1997). 

In  our  previous  research  (Gupta  et  al,  2005),  we  introduced  RST  motivated  feature 
selection  algorithms  for  a  multi-class  classification  task.  We  also  noted  that  the  high 
computational  complexity  of  feature  selection  algorithms  are  a  limiting  factor  and 
introduced  randomized  training  partitions  to  reduce  training  time.  Finally,  we  showed 
that  JohnsonsReduct  performed  comparably  to  IG  on  a  single  data  set.  In  this  paper, 
we  extended  JohnsonsReduct  to  work  with  multi-valued  features  and  introduced  the 
topic  of  fuzzy  discemibility.  In  addition,  we  introduced  MRD,  a  pure  rough  set 
version  of  Li  et  al.’s  (2006)  Fast  Rough  Set  Reduction  Approach.  While  this  increases 
computational  complexity,  it  is  offset  through  the  use  of  RTP.  We  also  improved  our 
evaluation  methodology.  For  example,  we  eliminated  variances  due  to  differences  in 
feature  weighting  by  weighting  all  features  equally,  added  a  single  classification  task 
to  improve  the  reliability  of  our  conclusions,  and  used  a  two-fold  cross  validation 
methodology  rather  than  random  sampling.  This  has  led  us  to  qualitatively  new 
results.  For  example,  we  found  randomized  training  partitions  to  be  effective  for  both 
rough  set  and  conventional  feature  selection  algorithms  (for  the  Reuters-21758  data 
set),  rather  than  only  for  the  former. 

Finally,  we  introduced  the  use  of  a  POS-bias  in  textual  case  bases  and  described 
why  it  can  impact  feature  selection.  This  explicit  manipulation  of  bias  appears  to  be 
novel;  we  are  not  aware  of  any  prior  research  on  using  background  knowledge  of  this 
type  to  assist  TCBR  systems  on  classification  tasks.  We  showed  that  biasing  feature 
selection  algorithms  can  significantly  increase  classification  accuracy  of  both 
conventional  and  RST-motivated  feature  selection  algorithms,  and  that  these  increases 
are  more  substantial  for  the  rough  set  algorithms. 


5  Conclusion 


Until  recently,  only  conventional  feature  selection  algorithms  (e.g.,  IG  and  its 
extensions)  had  been  applied  to  textual  CBR  with  little  concern  for  their 
computational  complexity.  In  this  paper,  we  rigorously  investigated  the  potential  of 
RST  approaches  to  improve  task  perfonnance  and  reduce  feature  selection  times.  We 
considered  two  RST  algorithms:  (1)  JR  with  lower  computational  complexity  than  IG 
and  (2)  MRD  with  much  higher  computational  complexity  than  IG.  We  evaluated  the 
effect  of  RTP  on  these  algorithms,  a  method  we  introduced  in  our  previous  research, 
to  dramatically  reduce  feature  selection  time.  In  addition,  we  introduced  a  novel  idea 


of  part-of-speech  bias  in  textual  CBR  that  could  affect  both  RST  and  conventional 
approaches.  Evaluation  of  these  methodologies  with  large  multi-class  and  n-ary 
classification  tasks  showed  that  JR,  suitably  biased,  significantly  outperforms  IG  and 
significantly  benefits  from  RTR  Furthermore,  POS  bias  significantly  improved  RST 
feature  selection  algorithms. 

Given  that  JR  significantly  outperformed  IG  on  our  data,  we  suspect  that 
Wiratunga  et  aids  (2004)  boosted  algorithm,  which  is  based  on  IG,  could  significantly 
benefit  from  our  methodologies.  We  also  conjectured  that  using  an  appropriate  POS 
bias  could  consistently  improve  IG,  and  that  IG  effectively  counters  bias  when  the 
number  of  cases  per  class  is  large.  In  our  future  work,  we  will  investigate  these 
conjectures. 
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